DOCUMENT BESUHE 



ED 128 400 



IB 005 56a 



ADTHOR 
TITLE 



POB DATE 
NOTE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Forbes, Dean 

The Use of Rasch Logistic Scaling Procedures in the 
Development of Short Hulti-Level Arithmetic 
Achievement Tests for Public School Measurement. 
[Apr 76] 

19p,; Paper presented at the Annual Meeting of the 
American Educational Research Association (60th ^ San 
Francisco, California, April 19-23, 1 976) 

MF-$0,83 HC-$1.67 Plus Postage. 

♦Achievement Tests; Elementary Education; *Elementary 
School Mathematics; Grade 7; Grouping (Instructional 
Purposes) ; *Grouping Procedures; *Individual 
Differences; Mathematics; Public Schools; Standard 
Error of Measurement; *Student Ability; Test 
Construction; Testing Problems; *Test Reliability 
Rasch Item Calibration; *Easch Model 



ABSTRACT 

Rasch calibration permitted the development of short 
achievement tests that were economical in testing time, and could be 
developed in a series of difficulty levels to suit student individual 
differences. Furthermore, these tests were of adequate reliability 
for practical educational measurement when individual students were 
assigned to tests of appropriate difficulty level* A variety of test 
placement strategies were considered and several were tried. Two 
formal procedures involving the use of a pretest screening tool for 
level assignment show promise of effectiveness but in the research 
described here tended to place many children in a test which was 
somewhat too difficult for them. The use of screening tests still is 
considered very promising although it is recommended that in the 
future criteria for test placement be modified so the students would 
be placed one, or perhaps two, levels lower than they were in the 
field test of these prototypes. It is further recommended that any 
students who get raw scores under 5, or over 25, immediately be 
retested with a more appropriate level to forestall the dramatic 
measurement error increases which occur when those limits are 
exceeded. (Author/BW) 
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Introduction 



In the general practice of public school testing many practical problems 
must be dealt with. TX^o of the more pressing problems deal with the amount 
of time required for achievement testing and the difficulty of assigning 
each child a test with a difficulty level appropriate to his performance 
capability. This latter point becomes especially critical as more atten- 
tion is paid to the debilitating and frustrating effects of presenting the 
child with a test that is far too difficult, or the motivational problems 
experienced by a child faced with a test which is far too easy. In both 
circTimstances the resulting information is of very questionable validity 
and the emotional impact on the child is undesirable. 

In an effort to improve the practical testing situation our school system, 
two years ago, started with an hypothetical model for an improved achieve- 
ment test and began to explore various methodologies to find a procedure by 
which such a test could be construe ted* Ve envisioned a test which would 
be parsimonious with respect to time, taking no more than 30 or approximately 
AO minutes to admi niatero Ve also wanted a test which would occur in a vari- 
ety of levels of difficulty within a gi/en grade so that every child could 
take a test which was appropriate to his present level of functioning. This 
made it mandatory that there be a defensible metric underlying the multi- 
level tests which were planned* 

Becent developments in the use of logistic functions scaling to define item 
and test performance, such as those presented by Rasch and championed by 
Dr. Benjamin Wright, seemed particularly promising. For this reason a 
project was undertaken to develop prototype multiple level seventh grade 
mathematics achievement tests to fit the aodel described above. Hie process 
by which the original item pool was calibrated, and by which the multiple 
level tests were actually developed, is documented in the appendix of the 
handout and will not be discussed at this time. Let it suffice, at the 
moment, to point out that seven short tests were developed. In local termi- 
nology they are called "level tests*" Bieir content was arranged in ascending 
order of difficulty within each test. Average difficulty between levels 
(which were adjacent in difficulty) varied by a predeteimned and constant 
interval. This meant that the level tests, themselves^ formed an ascending 
series with respect to difficulty. Biey were 'developed within the field of 
general raather^atics and were equally divided, in terms of content, between 
the conventional subtest areas of computation, concepts, and problem solving. 
Qiey were administered across the seventh grade of two out of three sub- 
districts in the Portland, Oregont school system (comprising in excess of 
2,OCX3 students) and the resulting data permits the study of a number of 
problems which had previously been identified (or which arose as the project 
proceeded) . 

At "this time, preliminary information will be given with respect to a number 
of these problems: (l) can short tests of this nature be competitive* in 
reliability with more traditional achievement tests? (2) is it practical, or 
even possible, to assign a specific child to one out of several available 
test difficulty levels? If such assignment to a specific difficulty level 
can be don v^^at are the most effective procedures for so doing? (3) if it 
is possible to devise a practical procedure for placing students in a test 
of appropriate difficulty level, what degree of measurement error will be 
involved? These problems will be discussed in sequence. 
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Reliability of Short Multi-Difficulty Level Tests 



In discussing "level test" reliability attention vdll be paid to only a por- 
tion of the available data* Diere were seven tests built at various levels 
of difficulty but in the placement procedures used the student samples for 
levels 1 and 2 (the two easiest levels) were sufficiently small that extended 
analysis was not carried out. Level 7 was sufficiently difficult that it was 
felt less appropriate to the grade level for which the tests were developed 
(grade 7) than to the next grade level above • 

The preliminary phases of analysis concentrated on levels 5, and 6. Table 1 
presents the first order Kuder-Richardson reliability coefficients for these 
three levels along with correlations of the level tests with an incumbent, 

Table 1 

Kuder-Richardson Reliabilities and Correlations 
with Incumbent Mathematics Achievement Tests 
for Level Tests ^, 3% and 6. 



Level 


KB20 


Correlation with 
Incumbent Test 


k 


.86 




5 


.79 


.72 


6 


.81 


.75 




K = 30 





conventional mathematics achievement test. It becomes readily apparent that 
the reliabilities, in terms of correlation coefficient, are slightly lower 
than would be desired. Correlations with the incumbent test are substantial 
and are very consistent with that magnitude of correlation coefficient com- 
monly found between achievement tests. 

An interesting possible explanation suggests itself and is currently being 
studied. Despite the fact that the student's score comes from a very short 
test, in one very real sense the procedure under exploration logically would 
be analogous to taking a single long test and identifying those items of 
appropriate difficulty range to span the general accomplishment level of each 
individual student. These items are then administered to the student who is 
not forced to take items lAich are either too easy or too difficult items for 
him. Linking the calibrations of the various test levels into an extended 
scale including the items from all levels would make it possible to assign 
each student a score resulting from placement on the total extended scale 
underlying tlie entire series of short tests. This would be equivalent to 
giving a child a test of, say 150 items, and then scoring only the 30 items 
at the cutting edge of his performance capabilities; all easier items would 
(in one sense) be credited to the child's score without the necessity for the 
student actually to have dealt with them^ If, in fact, this analog proves 
sound (and reliability estimates based on extended scale scores will provide 
one test of this in the near fur.ure) there is every reason to hope that re- 
liability of these tests will be more than adequate for conventional educa- 
tional usage. M 
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Another way of looking at the reliability of these tests deals with the amount 
of error involved in a test score (since this defines the confidence band 
within which a given score must be interpreted). The error of measurement 
inherent in the calibrated scores of th .'se short tests will be discussed in 
a later part of the paper but in passing it can be stated that error of measure- 
ment generally falls between one-third to one-half of a logistic scale anit and 
it 13 submitted that this is extremely competitive with many commercial tests. 

Table 2 presents correlations between various parts of level tests k, 5, and 6 

Table 2 

Pa-'t-Whole Correlations Involving 
Level Tests 4, 5, and 6. 



Level 


N 


Items 1-15 
Correlated with 
Total Score 


Items 16-30 
Correlated with 
Total Score 


Items 5-25 
Correlated with 
Total Score 


k 
5 
6 


iko 

200 
182 


.88 
.89 
.90 


.84 
.83 
.83 


.92 
.94 
.96 






K = 15 


K = 15 


K = 20 



with the totality of the same level tests. Column 1 shews the correlation be- 
tween the first 15 items in the total test; co.lumu 2, beti^een the second 15 
items in the total test; and column 3, the middle 20 items with the total* 
Since these are part-whole correlations we would anticipate that they would be 
very high and for most usage they would be spurious. It is interesting to note, 
however, that the correlation between the middle section of the test and the 
total test is sufficiently high that it suggests that practically the same meas- 
urement capability exists witlxin an even shorter test than was initially studied. 
Biis will be followed up in later research and, even if it is not utilized for 
measurement of individual student performance, it will have great implications 
for quick, easy, and painless program measurement where the goal involves the 
estimation of district perameters. 

Strategies for Placing Students in AT3Propriate Test Level 

A number of possible strategies were considered for placing the student in a test 
having an appropriate level of diffictdty. Teacher assignment is, of course, 
possible. Another procedure would let the student examine all levels and select 
the one that looked the most practical to him. Level assignment could be made 
on the basis of the child's previous test score. Assignment could be made in 
terms of a short screening test which would permit identifying general level 'of 
performance capability. 

One of thp sub-districts involved in this research adopted this final procedtxre 
and developed a short screrening test consisting of seven items, one representing 
the average difficulty level of items in each of the level tests. This screening 
test was subjected to scaleogram analysis to see if it composed a true unidimen- 
sional scale since this would provide the greatest precision in test assignment. 

Q 5 
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Dxe test did not scale perfectly although it did have a reproducibility coef- 
ficient of •83 which was deemed adequate for practical usage. In use, the test 
was given to the students ^o then scored it themselves as the teacher read the 
correct answers. The scudent then selected the level of test to be taken in 
terras of number of items correct on the screening. Two variations of the strategy 
were used, each with a sample of the schools Involved. In one group of schools 
the number of items correct indicated the level of the tests to be taken. Since 
it was felt that there was a real possibility that this would place students in 
a test where they could be expected to get anywhere from half to all the items 
correct, a second variation asked the students to take the test level that was 
one higher than their raw score on the screening test. 

Table 3 presents numerical data about score ranges cid Plate 1 graphs the range 
and inter-quartile range of test levels ^, 3, and fi. Data is included both for 
tests taken "on level" (where the screening test score indicated test level) 
and under "level plus one" circtmistances (where the child took the next higher 
than was indicated by screening t-^^t score). In g -al, we find that placement 
by both strategies resulted in children taking testis oome^at more difficxilt than 
optimum. 

Table 3 

Banges and Quartile Values for 
Level Tests ^, 3, and 6. 



Level 


Test Placement "On Level" 


Test Placement "Level plus One" 


Low 


Ql 


Md 


Q3 High 


Low 


Ql 


Md 




High 


k 


2 






12.5 


25 


1 


?.8 


8.1 


11.5 


21 




0 




8.p 


12.7 


28 


2 


6.0 


8.0 


10.2 


21 


6 


2 


6.8 1 






2? 


0 


k.7 


7.7 


11.2 


2k 



Generally speaking the perfect "level test", used with students that were effec- 
tively placed, would demonstrate a mean score at,^'or near, the mid-range of the 
test's raw score scale (in this case on or about 13)* I't would have a standard 
deviation such that the distribution of scores would go neither too high nor too 
low to provide effective measurement (i.e., the maximum raw score would not exceec* 
23 to 23 and the minimum not fall below 3 to 7~this would indicate a standard de- 
viation in the vicinity of 3 raw score points). Hopefully the scores would dis- 
tribute themselves fairly symmetrically although Rasch calibration makes no assump- 
tions with respect to normal distribution. Table 3 shows that the total range of 
scores at every level, both for on-level and "level plus one" placement, exceeds 
the range considered optimum (particularly at the lower end). At most levels 
there are indications of a floor effect since the 23th percentile typically falls 
somewhere between raw score 3 and 7 with the median around 10 for on-level place- 
ment and 8 for off-level placement* Uie third quartile (73th percentile) falls 
between 12 and I'f for on-level placement and 10 and 12 for level plus one. 

Examining the difference between quartile 1 and the median, and quartile 3 and the 
median- a consistent suggestion of positive skewness emerges. This skenmess, when 

6 
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Plate 1. 

Scaled Score Medians, Inter-quai^ftile Banges, 
and Total Ranges for Test Levels 3 through 7. 
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tested statistically, is substantial and indicates that there is a real tendency 
for the scores to pack up near the bottom part of the available score range with 
relatively few scores falling above raw scnre 15* (If placement were perfect, it 
will be recalled, the median would be approximately 15 and the quartile points 
would fall somewhere in the vicinity of 10 ind 20.) This suggests that there is 
a general tendency for the two screening test procedures to place the child at a 
higher test level than that which is desirable for most adequate measurements 

There are many possible reasons for this and these will not be discussed at this 
time. (For instance, it is entirely possible that students tended to score their 
screening tests in such manner that they gave themselves the benefit of the doubt 
when the correctness of an answer was in question.) Regardless of reasons for 
this over-placement, the data suggests that future use of such screening tests 
should place the student one level lower than his raw score. 

In li^t of the skewness which appeared in score distributions (which is not nec- 
essarily attributable to inefficiency of test level placement by means of a screen- 
ing test) it becomes important to know whether or not such inaccuracies in level 
assignment are sufficiently great that they cast doubt upon the usefulness of the 
scores, and, if so, what proportion of students are so affected. It is generally 
acknowledged that Rasch calibration provides meaningful scores until one reaches 
either of the two extremes of the score scale. When a score represents nearly 
perfect performance, or near zero perfonaance it is acknowledged that measurement 
(and logistic scaling) breaks down and no loixger provides meaningful scores; this 
is virtually the same sitxiation as that which occurs with any other type of test 
sc a l ing since it represents the situation where you have no score due either to 
too high a floor or too low a ceiling (all you know is that the test did not ade- 
quately measure the student involved and you do not know how much higher or how 
much lower his score would have been, given an ad quate measuring instrument). 

Specific criteria for upper and lower limits of effective score scale will be de- 
scribed in the last section of this presentation but reference to Table ^ indicates 
that approximately l8% of students tested on-level received raw scores below 5 
whereas 2^ of those tested under the 'level plus one" condition fell within this, 
range. A sufficiently small percentage under either placement strategy scored 
above 25 that we need not be concerned with inaccurate measurement at the upper end 
of the scale. The fact that so many students achieved raw scores of 5 or less 
raises a defin: le concern and will be t.^lored in the next section of the paper. 

Table ^ 

Percent of Cases Achi-^aving Raw Scores 
below 5 on Level Tests ^, 5» and 6. 



Level 


. '■'•■ested 
"On Level" 


Tested 
"Level Plus One" 


k 


17 


23 


5 




19 


6 




29 




X = 18.3 


X = 23.7 
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Meaaurement Error 

In previous sections of this paper it has been pointed out that measurement 
error generally is conservative for these short level tests, falling between a 
h 1 ^ ^ logistic scale point (which would translate to one and a 

half to two "standard score scale" points on the Rasch calibration scale). 
Table 5 presents the standard error of measurement in logistic scale terms for 
certain key raw score value (5, 10, 15, 20, and 25 points). Examination of the 

Table 5 

Standsurd Errors in Logistic Scale Tterms 
for Representative Raw Score Values of 
Level Tests 4, 5, and 6, 



Level 


Raw Scores 




10 




20 ( 25 


k 

5 
6 


.if 99 
.if 97 
.if 96 


.397 
.395 
.395 


.37it 
.373 
.373 


..396 
.395 

.395 


.if 97 
.if 96 
.if97 



table shows two obvious points, that measurement error for raw scores 5 and 25 
both are at or near .500 whereas the error for a raw «core of 15 (the median value) 
IS rou^ly .375. a second point demonstrated by the table is the fact that there 
IS great consistency in measurecent error from one level test to another amon« 
those described in this table (levels 3 through 6). 

Ihe marked similarity of data emerging in Table 5 raised the question as to how 
much actual difference did occur from level to level and with this in mind the 
curve of error relative to raw score was plotted for each level. When the error 

plotted it was found that the curves for all levels were 
virtually identical both in shape and in altitude. Plate 2 presents the composite 
curve which emerges and which is equally usable for any level from 3 through 6. 
An examination of this curve shows that its central portion, ro ghly from raw score 
I !°- ^^^^ "if*i^«ly ^lat but that below 5 and above 25 it rises rapidly and sym- 
metrically). Thi& confirms the rule of thumb that had tentatively been adopted 
which was to the effect that raw scores below 5 and above 25 (for a 30 item test) 
Should be treated with extreme caution due to probability of excessive measurement 
f^^Li^®®^*"-"* » was ^ed which was either too easy or 

J* individual). Vithin raw score ranges 

of 5 through 25 the short level tests do provide acceptable measurement, and^rror 
IS within tolerable liaits. i a** o^iwi. 

Summary 

In summary it can be said that Rasch calibration has peiraitted the development of 
snort achievement tests that are economical in testing time, and can be developed 
in a series of difficulty levels to suit student individual differences. Further- 
more, these tests were of adequate reliability for practical educational measure- 
ment when individual students were assigned to tests of appropriate difficulty 
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level# A variety of teat placement strategies were considered and several were 
tried* TWo formal procedures involving the use of a pre-test screening tool for 
level assignment show premise of effectiveness but in the resean^h described 
here tended to place many children in a test which was somewhat too difficult 
for them. The use of screening tests still is considered very promising although 
it is recommended that in tho future criteria for test placement be modified so 
the students would be placed oue, or perhaps two, levels lower than they were in 
the field test of these prototypes* It is further recommended that any students 
who get raw scores 'onder 5i or over 25 i immediately be retested with a more appro- 
prip.te level to forestall the dramatic measurement error increases which occur when 
those limits are exceeded* 
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APPENDIX 

Hiatoidcal Documentation: Development of the 
Multi-Level 7tix Grade Math Tests 

!Qi6 goal of the project was to develop a procedure for conatruction of better 
survey achievement testa that would be described by three major characteristics: 

1 shortness, 

2 * relevant <i to inatructional program , and 

3 - difficTilty level appropriate to present performance of each individual 

student* 

!Qie nature of the goals suggested the appropriateness of Basch calibration and 
aa a result of this the project vaa set up with this procedure in mind« 

!Qie resources utilized in the project were many and varied* Personnel came 
primarily frco Areas II and IH of the Portland Public School District with a 
great deal of conaultation from the Central Evaluation Department of the Portland 
achool system and a certain amount of involvement by the Multnomah Coimty I«E«D« 
and the Metropolitan Area Teating Program*^ The planners and on^anisers of the 
project were the persona reaponaible for evaluation and meaam*ement in their 
reapective Araaa (II and HI)* Data proceaaing conaultation and actual calibra- 
tion runa were cazried out by the Central Evaluation Department utilizing the 
computer center at the Bonneville Power Administration* The initial project 
(the prototype eatablishment of multi-level acl^ievement teats, and the teating 
appropriateneaa of Baach calibration to public achool curricula) involved a 
file of mathematica itema which were available and which had been written for 
7th grade uaage* Thia file of mathematica items had been written two yeara pi - 
vioualy for the purpoae of building a conventional survey mathematics achievement 
teat for grade 7 uaage and had undergone conventional, traditional item analysia* 
nw^ iff pool waa accepted as appropriate for the preaent project due to the nature 
and content of the itema, to its availability, and to the fact that item analyeia 
data waa available for each aingle item* 
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Since individual itema had previously ba^t: placed on cardSf along with all item 
analysis data, the first step of the project itself was to go throu^ the itcffls 
carefully and discart^. any weak or ineffective items as judged by conventional 
item analysis^ At this timet all items with difficulty levels suggesting the 
possibility of chance response (the very difficiilt items with percepts pass of 
20 and below) were tentatively rejected* (Further consideration suggested that 
items which had proved too difficult for general 7th grade usage might be rery 
fl^rcpriate for the exceptionally able 7th grader and the 8th or 9th grader* 
For this reason these items were considered as a separate sub*pool but were re« 
tained in the project*) 



!Qiere were 200 items in the major pool* !Qiey were arranged in order of diffi* 
culty and divided into four subsets* This gave one svibcet of very easy items 
(labeled U), one item of moderately easy items (X), one group of moderately diffi- 
cult items (7), and a group of difficult items (Z)* In each case l6 to 20 items 
were shared between adjacent subsets (compr-ising the more difficult items of the 
easier of the two levels and the least difficult items of the other level)* 
dis was to perndt the eventual linking of all four levels into one extended 
scale after the completion of Basch calibration* 

Ihe difficult items were also divided into subsets with half of the acre diffi- 
cult items going into one test (called 0-1) which shared some items with the link 
between V and X* (Illustration I)* The other half of the difficult items (D-3) 
shared items with the l ir^ between 7 and Z and half of the contents of each group 
(J)-I and 0-3) were combined with items from the X-7 link and were designated as 
0-2* T^T^ provided the capability of calibrating items in each of the seven sub- 
sets (W, X, 7, Z, 0-1, 0-2f and 0-3) and also permitted linking the various subsets 
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together so that an extended calibration acale could be developed which would en- 
coopasa the total range from the eaeieat item in level W to the most difficult 
items in the D series* 

Mockup booklets were prepared for each of the seven trial tests with the previously 
mentioned labels (W throu^ D-3)» After editorial checking of the mockup copies, 
booklets and administration of the '^tests'' was scheduled to coincide with regu- 
lar '•year--end'* testing programs* Biis would permit developing a calibration 
scale for each trial test, linking them all together, and developizig an extended 
scale to tie all of the items to one continuous underlying metric* 

Using the basic program for Rasch calibration developed by Wri^t and Panchepakesan, 
with modifications and improvements developed by Dr* Fred Forster of the Portland 
Public Schools Evaluation Department, the items in the trial tests were calibrated 
after administration to 7th grade students (in the case of forms W, X, T, and Z) 
and unusually able 7th grade students with a mixture of 8th and 9th grade students 
(in the case of D-1, D-2, and D-3)* Following the initial calibration, the operat- 
ing characteridtics of the items were re-examined and all items demonstrating 
weakness under fiasch calibration were eliminated (criteria for eliminr^tion of 
items were mean square fit in excess of 2*5, and ^teo-total score" correlations 
below .25) • Approximately 25 per cent of the items in the initial pool were re» 
Ject ^d. 

Following this editing of the item pool the surviving items were reorganized, 
links between the various trial tests were established and verified, and one ex- 
tended scale was developed td cover all of the items which survived in the total 
pool* At this time the total pool numbered 200 items* (The weak items which 

; . ^4 



were eli min ated were balanced by the itema in D-»l, and D-3 which proved to 

be uaable* 



When the item pool had been refined and all weak had been deleted, it was possible 
to start assefflbling oulti-level tests to fit the goals of the projects A nnmber 
of questions had to be answered conceriiing test length, range of diffictilty to 
be spanned by any individual te^r^^ and degree of overlap between adjacent tests* 
Due to lack of precedents it was necessary to arrive at some arbitrary decisions* 

In order to keep testing tine within reasonable limits, and to prepare packages 
that would fit into the conventional instructional day, it wab decided to plan 
tests that would be manageable within one classroom period* Since the items 
were basically in multiple choice format this indicated a length somewhere be«» 
tween 25 and kO ilams; a length of 30 items was selected* This would provide 
adequate leeway for distribution and collection time and still penoit a child 
to attempt all of the items in the test* 

The range of difficulty to be covered within any particular test presented 
another kind of proSem since there had been no prior e^erience (locally or 
nationally) on which to bul^d« Qie range of the total item pool (expressed in 
terms of scaled scores closely resembling conventional standard scores) was 
approximately 25 to 30 scaled score units* It was decided to uce a tesi: width 
of five score units, with an overlap of two units between adjacent levels* ISxis 
peiaitted the development of seven different test levels ranging from very easy 
to very difficult* 

Since the initial item pool had been described in terms of the three conventional 
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aritfaaetic subtests (computation, concepts, and problem solving) the pool was 
first subdiTided into sub^-pools representing these categories* Once the items 
had been listed in terms of scaled score value throughout each subtest it was 
possible to divide the pool into strata representing each proposed level test, 
identifying those items at each level vdiich would be unique to that level as 
well as those that would overlap with the next hi^er and lower levels* With 
the item pool mapped in this manner it was possible to identify those specific 
items which would meet the criteria of calibrated ac&le value and content to 
fill the blueprint for each level test. At most levels there were more than 
enough items to fill the blueprint so that it was possible to select specific 
items within each level to provide the best balance of coverage throughout the 
range to be covered by that particxilar test* With the very lowest level there 
was a shortage of tisable problems dealing with concepts* To fill this gap 
supernumerary items were selected from computation or problem solvingo At the 
most difficult end of the scale ther.v . j a less pronounced but similar shortage 
in computation problems having a hi^ degree of difficulty* With these two ex- 
ceptions it was possible to fill the blueprint at each level with items that had 
been calibrated and demonstrated effective* 

Master pages (to serve in the production of offset printing plates) were then 
prepared for each 'level with the items arranged in order of difficulty and num- 
bered serially from 1 to 30* All pages representing one specific level were 
numbered in the upper right^iand comer with a large ntimeral indicating the test 
level so that there would be no confusion on the part of a student selectdLng the 
level he/she was to iise* The decision was made to print all seven levels in one 
boddet so that the teacher would not need to worry about interfacing individual 
pupils with separate booklets at the right level* Ihis represented a calculated 
risk because it did create the possibility that students might inadvertently take 



an inappropriate- level of the test. Thia was not aean as a serious problem due 
to the fact that each level was sufficiently similar to the levels above and below 
that any three adjacent le^ar^r should provide effective raeaswar.ant for a given 
child and it was considered entirely possible that a child could get as nmch as 
two levels above or below that which was optimum and still be effectively measured* 
(It is true that with a disarticulation of two levels the child might be faced 
with a test that was either very difficult, or very easy, but until the point is 
reached where a child gets virtually a zero score on the one hand or a perfect 
score on the other, Hasch procedures will still develop a usable and effective 
score*) 

Another potential problem which was carefully considered dealt with the fact that 
all levels would be numbered from one through 30 making it possible to use the 
same answer sheet for all* This made it mandatory that care be exercised in proc- 
toring to make sure that each child identified the test level taken on the answer 
sheet* Teat level identification offered two major alternatives, a coding block 
where the child could indicate the level of the test taken (as is done in indi- 
eating grade level or sex) or a box where the child could write an Arabic numeral 
which , was not machine scannable but which then could be trans^^ribed in machine 
scannable format by a clerical operation between test admitiistration and scoring 
(in the field testing stage this second alternative was selected). 

At this stage of the project^ seven lev'els of a 7th grade test existed* Each 
level contained 30 items equally balanced between computation, problem solving, 
and concepts, with the levels ascending monotonically in difficulty with substan- 
tial overlap between adjacent levels* These levels had been printed in one multi- 
level booklet with the pages comprising each separate level carefully and obviously 
identified* A machine scannable answer sheet was selected (with the same answer 
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sheet usable at aD7 level of the test}* 



A decision was now necessary with respect to the wry in \rtiich the appropriate^ 
Irtvel for any particular child coxad be detemined. Biree general procedures 
came to mind and were considered* These wei^: (l) teacher assignment to level 
b««ed m judpaent of pupil's present capability in mathematics; (2) arbitraiy 
assignment of chiZdren to the median level (level k) unless there were obvious 
reasons %dxy the child should take a hi^r or lower level of the test; (3) assign- 
ment of test level by means of a pre-test screening device. 

In Area H of the Portland, Oregon, school system where the multi-level teat was 
field tested across the entire 7th graxie (comprising rou^ily I5OO students in 30 
elementary schools) it was decided to use a abort screening test* The screening 
test was developed by selecting, from among the unused items in the calibrated 
pool, one item from the mid-difficulty range of each subtest. It was hoped that 
these items would form a mi dimensional scale in the Guttman sense and for this 
reason a scalegram analysis of the performance of the items was carried out« ^Ihe 
items did not form a true scale although the reproducability was cgpproxLm&tely 
.80. Despite the fact that a unidimensional scale was not found it was decided 
to use the screening test by asking the children to answer the items on the test, 
score their own paper as the teacher read the answers,! and th^n use th<)ir score 
on the screening test as an indication of which level should be taken* ^o 
strategies were tried out, each with.lialf of the schools in Area 11^ ie first 
strategy was to have the child take the raw score on the screening teat as an 
indication of the level test to be taken (e^g*, if a child got four of the seven 
items on the screening test correct, the child took level k). The second altema- 
ti^m was built on the rationale that a score of k on the screening test indicates 
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that the child ought to be able to handle practically everythiiig on level k and, 
in fact, would be more appropriately served by taking level 5« one level higher 
than ^la screening test raw score would indicate* As a result, the second alterna- 
tive was to use the screening test in precisely the same manner as has been pre- 
viously described but then instruct the children to add 1 to their screening teat 
score to identify the level test which was to be taken* 

The field testing was carried out, the answer sheets were sorted by level of test 
taken and the test level coded on the answer sheet* They were then sent to the 
Multnomah County Intermediate Education District data processing department where 
they were scored by means of a digitek optical scarner and tapes were then for- 
warded to Dr. Forster in the Portland Public Schools Evaluation Department who 
performed the Rasch calibrations utilizing the data processing equipment at the 
Bonneville Power Agency* 
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